Learning in Restless Bandits Under Exogenous Global Markov Process

نویسندگان

چکیده

We consider an extension to the restless multi-armed bandit (RMAB) problem with unknown arm dynamics, where exogenous global Markov process governs rewards distribution of each arm. Under state, evolves according Markovian rule, which is non-identical among different arms. At time, a player chooses out $N$ arms play, and receives random reward from finite set states. The are restless, that is, their local state regardless player's actions. Motivated by recent studies on related RMAB settings, regret defined as loss respect knows dynamics problem, plays at time $t$ maximizes expected immediate value. objective develop arm-selection policy minimizes regret. To end, we Learning under Exogenous Process (LEMP) algorithm. analyze LEMP theoretically establish finite-sample bound show achieves logarithmic order time. further numerically present simulation results support theoretical findings demonstrate significantly outperforms alternative algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Regret Bounds for Restless Markov Bandits

We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner’s actions. We suggest an algorithm that after T steps achieves Õ( √ T ) regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we sho...

متن کامل

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret

In this paper we consider the problem of learning the optimal dynamic policy for uncontrolled restless bandit problems. In an uncontrolled restless bandit problem, there is a finite set of arms, each of which when played yields a non-negative reward. There is a player who sequentially selects one of the arms at each time step. The goal of the player is to maximize its undiscounted reward over a...

متن کامل

Competing Bandits: Learning Under Competition

Most modern systems strive to learn from interactions with users, and many engage in exploration: making potentially suboptimal choices for the sake of acquiring new information. We initiate a study of the interplay between exploration and competition—how such systems balance the exploration for learning and the competition for users. Here the users play three distinct roles: they are customers...

متن کامل

Opportunistic Scheduling as Restless Bandits

In this paper we consider energy efficient scheduling in a multiuser setting where each user has a finite sized queue and there is a cost associated with holding packets (jobs) in each queue (modeling the delay constraints). The packets of each user need to be sent over a common channel. The channel qualities seen by the users are time-varying and differ across users; also, the cost incurred, i...

متن کامل

Particle Filtering And Restless Bandits 1 Running Head: PARTICLE FILTERS AND RESTLESS BANDITS Modeling Human Performance in Restless Bandits with Particle Filters

Bandit problems provide an interesting and widely-used setting for the study of sequential decision-making. In their most basic form, bandit problems require people to choose repeatedly between a small number of alternatives, each of which has an unknown rate of providing reward. We investigate restless bandit problems, where the distributions of reward rates for the alternatives change over ti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Signal Processing

سال: 2022

ISSN: ['1053-587X', '1941-0476']

DOI: https://doi.org/10.1109/tsp.2022.3224790